25 research outputs found

    Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective

    Full text link
    Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here software engineering needs to be re-thought where data becomes a first-class citizen on par with code. One striking observation is that a significant portion of the machine learning process is spent on data preparation. Without good data, even the best machine learning algorithms cannot perform well. As a result, data-centric AI practices are now becoming mainstream. Unfortunately, many datasets in the real world are small, dirty, biased, and even poisoned. In this survey, we study the research landscape for data collection and data quality primarily for deep learning applications. Data collection is important because there is lesser need for feature engineering for recent deep learning approaches, but instead more need for large amounts of data. For data quality, we study data validation, cleaning, and integration techniques. Even if the data cannot be fully cleaned, we can still cope with imperfect data during model training using robust model training techniques. In addition, while bias and fairness have been less studied in traditional data management research, these issues become essential topics in modern machine learning applications. We thus study fairness measures and unfairness mitigation techniques that can be applied before, during, or after model training. We believe that the data management community is well poised to solve these problems

    Inspector Gadget: A Data Programming-based Labeling System for Industrial Images

    Full text link
    As machine learning for images becomes democratized in the Software 2.0 era, one of the serious bottlenecks is securing enough labeled data for training. This problem is especially critical in a manufacturing setting where smart factories rely on machine learning for product quality control by analyzing industrial images. Such images are typically large and may only need to be partially analyzed where only a small portion is problematic (e.g., identifying defects on a surface). Since manual labeling these images is expensive, weak supervision is an attractive alternative where the idea is to generate weak labels that are not perfect, but can be produced at scale. Data programming is a recent paradigm in this category where it uses human knowledge in the form of labeling functions and combines them into a generative model. Data programming has been successful in applications based on text or structured data and can also be applied to images usually if one can find a way to convert them into structured data. In this work, we expand the horizon of data programming by directly applying it to images without this conversion, which is a common scenario for industrial applications. We propose Inspector Gadget, an image labeling system that combines crowdsourcing, data augmentation, and data programming to produce weak labels at scale for image classification. We perform experiments on real industrial image datasets and show that Inspector Gadget obtains better performance than other weak-labeling techniques: Snuba, GOGGLES, and self-learning baselines using convolutional neural networks (CNNs) without pre-training.Comment: 10 pages, 11 figure

    Multidrug-Resistant Acinetobacter spp.: Increasingly Problematic Nosocomial Pathogens

    Get PDF
    Pathogenic bacteria have increasingly been resisting to antimicrobial therapy. Recently, resistance problem has been relatively much worsened in Gram-negative bacilli. Acinetobacter spp. are typical nosocomial pathogens causing infections and high mortality, almost exclusively in compromised hospital patients. Acinetobacter spp. are intrinsically less susceptible to antibiotics than Enterobacteriaceae, and have propensity to acquire resistance. A surveillance study in Korea in 2009 showed that resistance rates of Acinetobacter spp. were very high: to fluoroquinolone 67%, to amikacin 48%, to ceftazidime 66% and to imipenem 51%. Carbapenem resistance was mostly due to OXA type carbapenemase production in A. baumannii isolates, whereas it was due to metallo-β-lactamase production in non-baumannii Acinetobacter isolates. Colistin-resistant isolates were rare but started to be isolated in Korea. Currently, the infection caused by multidrug-resistant A. baumannii is among the most difficult ones to treat. Analysis at tertiary care hospital in 2010 showed that among the 1,085 isolates of Acinetobacter spp., 14.9% and 41.8% were resistant to seven, and to all eight antimicrobial agents tested, respectively. It is known to be difficult to prevent Acinetobacter spp. infection in hospitalized patients, because the organisms are ubiquitous in hospital environment. Efforts to control resistant bacteria in Korea by hospitals, relevant scientific societies and government agencies have only partially been successful. We need concerted multidisciplinary efforts to preserve the efficacy of currently available antimicrobial agents, by following the principles of antimicrobial stewardship

    Decline in subarachnoid haemorrhage volumes associated with the first wave of the COVID-19 pandemic

    Get PDF
    BACKGROUND: During the COVID-19 pandemic, decreased volumes of stroke admissions and mechanical thrombectomy were reported. The study\u27s objective was to examine whether subarachnoid haemorrhage (SAH) hospitalisations and ruptured aneurysm coiling interventions demonstrated similar declines. METHODS: We conducted a cross-sectional, retrospective, observational study across 6 continents, 37 countries and 140 comprehensive stroke centres. Patients with the diagnosis of SAH, aneurysmal SAH, ruptured aneurysm coiling interventions and COVID-19 were identified by prospective aneurysm databases or by International Classification of Diseases, 10th Revision, codes. The 3-month cumulative volume, monthly volumes for SAH hospitalisations and ruptured aneurysm coiling procedures were compared for the period before (1 year and immediately before) and during the pandemic, defined as 1 March-31 May 2020. The prior 1-year control period (1 March-31 May 2019) was obtained to account for seasonal variation. FINDINGS: There was a significant decline in SAH hospitalisations, with 2044 admissions in the 3 months immediately before and 1585 admissions during the pandemic, representing a relative decline of 22.5% (95% CI -24.3% to -20.7%, p\u3c0.0001). Embolisation of ruptured aneurysms declined with 1170-1035 procedures, respectively, representing an 11.5% (95%CI -13.5% to -9.8%, p=0.002) relative drop. Subgroup analysis was noted for aneurysmal SAH hospitalisation decline from 834 to 626 hospitalisations, a 24.9% relative decline (95% CI -28.0% to -22.1%, p\u3c0.0001). A relative increase in ruptured aneurysm coiling was noted in low coiling volume hospitals of 41.1% (95% CI 32.3% to 50.6%, p=0.008) despite a decrease in SAH admissions in this tertile. INTERPRETATION: There was a relative decrease in the volume of SAH hospitalisations, aneurysmal SAH hospitalisations and ruptured aneurysm embolisations during the COVID-19 pandemic. These findings in SAH are consistent with a decrease in other emergencies, such as stroke and myocardial infarction
    corecore